Introduction to Bayesian Data Analysis

Michael Franke & Michael Henry Tessler

overview

three pillars of BDA

 

  • parameter inference

 

  • model comparison

 

  • model criticism

coin flips

     

Is Grandma’s old lucky coin fair?

     

How could we find out?

     

  • we flipped Grandma’s coin \(n=24\) times

  • we observed it land heads \(k = 7\) times

     

     

What now?

coinFlip

infering unobservables from data

   

we want to infer latent variables, which are not directly observable, from observable data (e.g., a coin’s bias, (properties of) mental processes etc.)

we often have a clear idea of how a vector of latent variables \(\theta\) makes each possible data observation \(D\) more or less likely, i.e., a likelihood function \(P(D \mid \theta)\)

think of the likelihood function as our theory of the data-generating process

we use the likelihood function to reason “backwards” from data to latent variables

brain

example of a likelihood function

the binomial distribution gives the probability of observing \(k\) successes in \(n\) coin flips with a bias of \(\theta\):

\[ P(k \mid \theta; n) = \binom{n}{k} \theta^{k} \, (1-\theta)^{n-k} \]

   

   

(at least) two possibilities

     

     

  1. frequentist:
    • a model consists of a likelihood function \(P(D \mid \theta)\)
    • we test a null-hypothesis (e.g., \(\theta = 0.5\)) using \(p\)-values

     

  1. Bayesian:
    • a model consists of a likelihood function \(P(D \mid \theta)\) and a prior \(P(\theta)\)
    • we calculate \(P(\theta \mid D)\) using Bayes rule

null-hypotesis significance testing

NHST in a nutshell

we fix a null hypothesis (e.g., the coin is perfectly fair: \(\theta = 0.5\))

 

the \(p\)-value gives the probability, under the null hypothesis, of an outcome at least as unlikely as the actual outcome [roughly put]

 

we fix a significance level, e.g.: \(0.05\) (to determine expected \(\alpha\)-error of falsely rejecting NH)

 

we speak of a significant test result iff the \(p\)-value is below the pre-determined significance level

 

we conventionally reject the null hypothesis iff test result is significant

example: fair coin?

  • we flip \(n=24\) times and observe \(k = 7\) successes
  • null hypothesis: \(\theta = 0.5\)
  • the \(p\)-value is the sum of the probabilities of all events no more likely than the observed data, under the assumption that the null hypothesis is true
    • exact test; no further idealization/approximation necessary

what \(p\)-values are not

 

\(p\)-values are not to be confused with:

  • the probability that the null hypothesis is true

  • the degree of confidence that the true parameter is, say, \(\theta = 0.5\)

  • an indicator that the negation of the NH is true/probable

Bayes rule for parameter inference

Bayes rule

 

the conditional probability of \(X\) given \(Y\) is defined (if \(P(Y) \neq 0\)) as:

\[ P(X \mid Y) = \frac{P(X \cap Y)}{P(Y)} \]

 

Bayes rule derives \(P(X \mid Y)\) from \(P(Y \mid X)\):

\[ \begin{align*} P(X \mid Y)\ = \frac{P(Y \mid X) \cdot P(X)}{P(Y)} \end{align*} \]

 

version for data analysis:

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(D \, | \, \theta)}_{likelihood} \ \underbrace{P(\theta)}_{prior}\]

example 1: single toss

  • single coin flip with unknown success bias \(\theta \in \{0, \frac{1}{3}, \frac{1}{2}, \frac{2}{3}, 1\}\)
  • flat prior beliefs: \(P(\theta) = .2\,, \forall \theta\)

 

model likelihood \(P(D \, | \, \theta)\):

##      t=0 t=1/3 t=1/2 t=2/3 t=1
## succ   0  0.33   0.5  0.67   1
## fail   1  0.67   0.5  0.33   0

weighing in \(P(\theta)\):

##      t=0 t=1/3 t=1/2 t=2/3 t=1
## succ 0.0 0.066   0.1 0.134 0.2
## fail 0.2 0.134   0.1 0.066 0.0

       

posterior \(P(\theta \, | \, \text{heads})\) after one success:

##   t=0 t=1/3 t=1/2 t=2/3   t=1 
## 0.000 0.132 0.200 0.268 0.400

example 2: \(n=24\), \(k=7\)

likelihood

\[ P(k \mid \theta; n) = \binom{n}{k} \theta^{k} \, (1-\theta)^{n-k} \]

prior

\[ \theta \sim \text{Uniform}(1,1) \]

posterior

\[ P(\theta \mid k, n) = \frac{P(\theta) \ P(k \mid \theta; n)}{P(D)} \]

the 95% highest density interval (HDI) is a subset \(Y\) of parameter values with \(P(Y) = .95\) such that no point outside of \(Y\) is more likely than any point within

3 pillars of BDA

estimation, comparison, prediction

parameter estimation:

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]

 

model comparison

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

 

prior predictive

\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \]

posterior predictive

\[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]

3 pillars of BDA

     

estimation comparison criticism
goal which \(\theta\), given \(M\) & \(D\)? which better: \(M_0\) or \(M_1\)? \(M\) good model of \(D\)?
methods Bayes rule Bayes factor, cross-validation \(p\)-values, PPCs
computational tools MCMC, variational Bayes Savage-Dickey, bridge sampling MC sampling

why go Bayes?

common criticism of NHST

 

  • looks at point estimates only, therefore ignores important information

 

 

  • bag of magic tricks
    • suggests to think of statistics as finding and applying the appropriate test

tunnel vision

pros & cons of BDA

 

pro

  • well-founded & totally general
  • more informative / insightful
  • stimulates view: “models as tools”
  • easily extensible / customizable

Drawing

       

con

  • less ready-made, more hands-on
  • not yet fully digested by community
  • possibly computationally complex
  • requires thinking (wait, that’s a pro!)

MinorThreat